-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update predeploy to restart old VMSS when service secrets rotated #2946
update predeploy to restart old VMSS when service secrets rotated #2946
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick comment I'd like some discussion on.
4783d36
to
4ff4541
Compare
pkg/deploy/predeploy.go
Outdated
} | ||
|
||
// wait for load balancer probe to change the health status | ||
time.Sleep(30 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to sleep here, or is the context.WithTimeout
on line 546 sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand, context.withTimeout is to exit from the long running reporting of unhealthy status, but here after the aro-rp restart there's a chance that rp-probe remains healthy as probe runs with interval 15 seconds and threshold 2. That is rp-probe will start reflecting correct relevant health status for vmss after 30 seconds of restart.
Please correct me if there's something wrong with the understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As i mentioned, I am not entirely sure we need this check. The thing we should check is the output of the command we are running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get what @rajdeepc2792 is saying and I see why we would need this time.Sleep
. I think we may need to adjust the amount of time though. Two reasons:
- I don't think you accounted for the "timeout period" mentioned here: https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-custom-probe-overview#probe-interval. I feel like the docs aren't super clear about what the timeout period really is, but it seems like we can assume it is at most 10 seconds. So reusing the logic you used to get to 30 seconds, we would need a sleep of (15 second interval + 10 second timeout period) * 2 probes = 50 seconds if we were to account for the timeout period.
- I was checking out the health probe on the rp-lb in prod eastus to better understand how it was configured, and I noticed a warning about how there is an Azure bug where the
numberOfProbes
will always be 1 regardless of the configured value. You can find this bug documented on Azure documentation here: https://learn.microsoft.com/en-us/azure/load-balancer/whats-new#known-issues. So with that in mind, we would actually only need 15 second interval + 10 second timeout period = 25 seconds before the health status of the RP is accurately reflected.
So in summary, I think 25 seconds is long enough, but keeping it at 30 seconds won't hurt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kimorris27 for the detailed reasoning, I agree with your analysis. Also to add as pointed out in the comment the rp-probe
configuration is set here, and as the known-issues link suggests the azure product team is working on it, the day it is fixed the waitForReadiness check might stop reflecting the real probe status if kept the timeout to ~30seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point. It may be worth leaving a comment in this code with a link to info about the bug. Or maybe there's a better way to document this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small suggestion, PR looks good.
cdd3b55
to
b8198aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rajdeepc2792 this is nice work! I have a small find, that should improve the last check. Would you please look into it?
pkg/deploy/predeploy.go
Outdated
} | ||
|
||
// wait for load balancer probe to change the health status | ||
time.Sleep(30 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As i mentioned, I am not entirely sure we need this check. The thing we should check is the output of the command we are running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No Unit Tests.
You should have some.
The code needs some simplification, and I am not sure we want to restart all services concurrently.
b8198aa
to
59b7f43
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, but agreed it would be nice to have some tests before merge.
59b7f43
to
a1078f9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rajdeepc2792 for all of your work with the tests. This test coverage goes above and beyond. I think some of the repetition could be reduced and would like others who are better at reviewing tests to have a look as well.
Comments addressed
Which issue this PR addresses:
Jira - https://issues.redhat.com/browse/ARO-3240
What this PR does / why we need it:
Here's the error scenario described:-
Deployment
Cluster creation
Suggested Fix:-
Test plan for issue:
Reproduced issue in dev using full-RP deployment:-
make deploy
using changes:- rotated the encryption keys and skipped new VMSS deploymentchacha20poly1305
error.Performed the above step again, with partial fix:-
make deploy
using changes:- rotated the encryption keys, restarted aro-rp in the old rp VMSS and skipped new VMSS deployment.chacha20poly1305
error, this time due to aro-gateway not restarted.Performed the above step again, with complete fix:-
make deploy
using changes:- rotated the encryption keys, restarted aro-gateway in the old gateway VMSS, restarted aro-rp in the old rp VMSS, and skipped new VMSS deployment.make deploy
in between the cluster creation again, restarting the aro-rp and aro-gateway services.Unit Tests added for predeploy.go functions.
Is there any documentation that needs to be updated for this PR?
No
There's a need to watch the lag in rp restart during release in prod because of #157.
If the lag is frequent and impacts the release process time there's a need to come up with an alternative approach to this PR.